Tutorial Contents

Non-Linear Curve Fit

Trace data

Improving fits

Choosing models with BIC

Remove trendline

External data

Contents

Non-Linear Curve Fit

DataView implements a general non-linear curve fitting routine adapted from that of Prof. John C. Pezzullo, whose generous help I gratefully acknowledge. The concept is that a collection of X-Y data values is presumed to follow a mathematical function (the model), such as exponential decay. Curve fitting is the process of finding the parameters of the function that enable it to best fit to the data.

In summary, the algorithm attempts to find the set of parameters that minimize the sum of squared residuals (SSR, also known as the residual sum of squares, RSS), where a residual is the difference between the actual value of a point and its value as predicted by the model with a given set of parameters. It assumes that the deviations from the model are a result of Gaussian noise. This may not be strictly true for neural data (membrane noise resulting from background synaptic activity may show strong serial correlation), but nonetheless the algorithm is often successful at finding a set of parameters that provide a good fit to the data. Please consult Prof. Pezzullo's linked web page for full details of the algorithm.

The X-Y data can either be a time-slice section within a standard data trace, in which case the X values are the relative time within the section and increase throughout the collection with an equal step size set by the sample interval, or an independent collection of X-Y values derived from an external source, in which case X values need not be evenly spaced, and may not necessarily be in order.

Trace Data Analysis

The main window in this dialog box shows the section of the data trace that you will analyse. This is either the entire region of the main display, or, if you had placed two vertical cursors in the display, the region bracketed by those cursors. The red line plots the expression shown in the Equation edit box, and will eventually show the fitted curve.

Non-linear curve fitting requires three user inputs.

First you have to define the “model”, i.e. the equation for the function that you are trying to fit. The data clearly suggest an offset sine wave (and of course we know that this is the case from the formula shown in the trace label), but to make life more interesting, let’s fit to a cosine curve.

In this expression t is time, as in the expression used to plot the function in the first place. The a parameter defines the phase shift for the cosine curve. The b parameter (in combination with the time t) defines the frequency of the cosine wave, while the c parameter defines the vertical offset. These are the three parameters to which we wish to fit values.

Now you have to enter estimates of the initial values for the parameters. Curve-fitting progresses more quickly, and is more likely to result in a realistic result, if your initial estimates are reasonable.

DataView allows you to experiment with initial values and to see the results in the red line on the display.

With these settings, with luck and a following wind, the red line will adjust to fit pretty much exactly to the data. If it heads “in the right direction” but is still not a good fit, click Curve fit again. If there is still no fit, try adjusting the values by hand to get a better initial fit.

Non-linear curve fit
The non-linear Curve fit dialog showing the fit to a noisy sine wave constructed in DataView.

The parameters in the edit boxes have adjusted to reflect fitted values. The offset parameter c is extremely close to the formula value of 2. The phase parameter a is very nearly equal to -π/2 (-1.5708), which, converted from radians to degrees gives -90°, which is the correct phase shift to convert from cosine to sine. The frequency parameter b is very nearly equal to 2π/360 (0.017454), which is what we used to generate the curve in the first place.

At the bottom of the dialog box is a display showing the RMS, which is a goodness-of-fit measure. It is normalised so that if the S.D. parameter is accurate, then an RMS value close to 1 indicates a good fit between the data and the model (i.e. deviations from the model prediction are in line with those that would be expected from the noise in the data), while values much greater than 1 indicate a poor fit. The p value indicates the probability that deviations from the fitted curve which are found in the data could have arisen by chance (so low p values indicate that the data are not a good fit to the model).

The accuracy of the RMS and p values are highly dependent on the accuracy of the noise estimate entered as the Data S.D. above. If you just enter 1, then the RMS is simply the square root of the mean of the SSR (sum of squared residuals), which is the population standard deviation. If you are sure that your choice of model is correct then you could use this as the S.D. value and re-fit to find a new p value for the goodness-of-fit value. But of course if you do this you could not use a good fit as quantitative evidence that the model is correct – that would be a circular argument.

The curve fit process works by iteratively adjusting the parameters so as to minimize the difference between the data and the red line. The iterations stop either when the number of iterations exceeds that set in the Max iter edit box, or when the Convergence criterion is met. This latter is defined as the greatest percentage change that occurs in any parameter between iterations. Thus when parameters are changing by only very small amounts between iterations, it is unlikely that further iterations will significantly improve the fit.

Sometimes the curve fitting process just does not work; the red line may hover around some function shape that is obviously wrong, or may “blow up” and yield ridiculous shapes. This is particularly likely if your initial parameter estimates are wildly wrong.

Improving fits

If the fit obviously fails, the first thing to do is to check whether the model itself may be totally wrong. If you are happy with the model, try to improve the initial estimates of parameters that you enter. Get the red line fitting as closely as possible to the data before you click the Curve fit button. If the fit still fails, try reducing the Step size parameter. This will reduce the speed of fitting, but may well head off errors before they become critical. If the fit looks like it is working but is still not very good after 10 iterations, then just click Curve fit again. Alternatively, increase the Max iter parameter to a larger number. A useful strategy is to start curve fitting with a small Step size (e.g. 0.1) but only allow 10 iterations. If the fit looks like it is heading in the right direction, then increase the number of iterations to a large number (e.g. 100), and wait for convergence.

Choosing Between Models with BIC

In most real-world situations we do not know for sure what underlying process produces a particular waveform.

We now have only a fragment of the sine wave visible, and without prior knowledge we might be tempted to try fitting a 3rd-order polynomial to the data.

We get a reasonable fit to the curve, with the RMS value close to 1, and the p value that is insignificant. However, note that the fit does not meet the convergence criterion.

In general, if choosing between alternative models which are apparently equally plausible, one should choose the model which generates the lowest (most negative) BIC value for the same set of data. In this case we know that the data were generated from a sine wave, and so the relative BIC value has led us to the correct decision.

BIC – the Bayesian Information Criterion

You may have already come across the BIC in terms of fitting histogram models by maximum likelihood, but now we consider it in the context of least squares fitting. If we have various alternative models that could fit a set of data, how do we decide which one provides the “best” description of the data? A standard metric for measuring goodness-of-fit is the sum of squared residuals (SSR) as described above. The lower the SSR, the better the fit. Indeed, the curve fitting algorithm used in DataView works by finding the set of parameters that minimize the SSR.

At first sight this would seem to solve the problem – just select the model with the lowest SSR. However, this does not take into account the complexity of the model. If we make the model complex enough, then we can make it produce an exact fit to any data. However, Occam’s Razor suggests that we should prefer a simple model with just a few parameters that produces a reasonable fit, over one with a thousand parameters, even if the latter produces a somewhat better fit. The Bayesian Information Criterion (BIC) formalizes this intuition, and provides a mathematical basis for deciding on the trade-off between complexity and goodness-of-fit. It is based on information theory, and I certainly don’t understand the full the theoretical basis, but its formulation is quite simple.

         BIC = n log(SSR/n) + k log(n)
where n is the number of data points, and k is the number of parameters. SSR/n is the same as the variance of the noise.
We can see that having a poorer fit and hence a larger SSR produces a larger BIC, and that having more parameters produces a large BIC. The formulation above determines the trade-off between these two influences. If we choose the model with the lower BIC, we will have chosen the better model.

BIC and Curve-Fit Caveats

Remove the Trendline

Sometimes it is the residuals (i.e. the deviations from the fitted curve) that contain the information of interest. This would be the case if a recording was contaminated by some regular shift in baseline, such as an exponential drop over the course of a recording (but see Remove exponential drop for a specialist routine for calcium imaging experiments). The Curve Fit dialog allows you to construct a trace showing just the residuals.

Assuming that you have already fitted a curve to the noisy sine data as described above:

When the file loads, the second trace shows just the deviations of the data in trace one from the fitted curve. In this case the deviations are just random noise, but in a real experiment they might well be informative.

External Data Analysis

As well as fitting a curve to a data trace (or section of a trace: see Trace Analysis above) DataView can also fit a curve to data imported from another program, or from an analysis routine within DataView itself. A full description of the process and its controlling options is given above following the Trace Analysis section, but here are some key differences:

  1. Data must be pasted from the clipboard or loaded from a file, they are not automatically read from a trace. In either case the format is simple: the data should be in two tab- or comma-separated columns, with the left column containing the X values and the right column containing the Y values. There can be a text header row, but this is ignored.
  2. The X values of the data do not have to be evenly spaced.
  3. The data do not have to be in X-order, although the graphical display can be simpler to interpret if they are.

You can now see the X-Y values from the external source in black, plus a red line showing the fit to the data using the default equation and parameters (these are just holding values and have no relevance to the current tutorial).

Because the rows are not sorted in X value order, the lines connecting the data points are jagged. While this does not matter for curve fitting, it makes it more difficult to see the trend in the data.

The rows are now sorted in ascending X (time) order.

Cooling approximately follows a declining exponential time course (Newton's law of cooling) which can be expressed thus:

(a - b) * exp(x / -c) + b

where a is the initial temperature, b is the environmental temperature towards which the object cools, and c is a time constant.

Examination of the raw data shows that the starting temperature is about 200 (the temperature is in degrees Fahrenheit, so this is near boiling), the ending temperature is about 60 (a coolish room), and the time constant looks to be about 15 minutes (that's a quarter across the graph).

The red fit line now lines up quite closely with the black raw data line.

The red fit line rapidly adjusts to a very close fit to the black data line. The numerical values of the parameters can be read from the a, b and c parameter boxes.

Details about the other controls and parameters were given earlier, so you can go back and read them if you have not already done so.